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Abstract — Credibility of a web-based document is an 
important concern, when a large number of documents is 
available on internet for a given subject. In this paper, various 
criteria that affect the credibility of a document are explored. 
An attempt is made to automate the process of assigning a 
credibility score to a web-based document. Presently the 
prototype of the tool developed is restricted to only four criteria 
- type of website, date of update, sentiment analysis and a 
pre-defined Google page rank. Also a separate module for 
checking "link integrity" of a website is developed. To obtain 
empirical validity of the tool, a pilot study is conducted which 
collects credibility scoring for a set of websites by human 
judges. The correlation between the scores given by human 
judges and the scores obtained by the tool developed is low. 
The possible reasons for the low correlation are firstly, the 
tool is restricted to only four criteria, and secondly, subjects 
themselves had no agreement. Apparently they judged the 
website on different criteria, and not weighted overall. Further 
enhancements to the work done in this paper can be of great 
use to a novice user, who wishes to search a reliable web- 
based document on any specific topic. This can be done by 
including all criteria (discussed in this paper) for calculating 
the credibility score of a website. 

Index Terms — credibility score; source evaluation; 
automation; web-based multiple documents 

I. Introduction 

A. Need for Source Evaluation of web-based Documents 

The number of web-based documents available for a given 
subject is tremendous. So it becomes necessary to evaluate the 
sources of these documents in order to choose the most 
appropriate documents. Studies conducted by Metzger et al. 
[1] indicate that particularly college students rely heavily on 
the web for both general and academic information, but verify 
the information very rarely. Another study by Walraven et al. 
[2] also indicates that students do not frequently explicitly 
evaluate 'sources' and 'information' with respect to a web-based 
document. Empirical psychological research shows that 
adequately evaluating the credibility of sources is important: 
students who do this well or assess reliability and use source 
characteristics achieve better comprehension of the content 
[3]. Research by Amin et al. [4] demonstrates that providing a 
novice user with credibility scores of web-pages boosts the 
confidence level in the selection of information, though it 
does not make search more efficient. Case studies by [5] 
emphasize on critical analysis of internet and scholarly sources 
by undergraduates, as they are unable to discriminate between 
credible and non-credible sources. 
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In this paper we try to automatically assign credibility indices 
to web-based documents, based on different criteria and 
report back to the users. 

B. Various criteria of source evaluation 

The different criteria based on which we can evaluate a 
web-based source are as follows: Type of website. A website 
can be an educational website (.edu, .ac), government 
website (.gov), a commercial website (.com), organizational 
website (.org) or other website. Depending on the type, a 
website can be more or less reliable; Date of the web-site, 
when it was last updated, also affects the credibility of a 
web-based document; A web-document being a primary or 
secondary source, also decides the credibility of the 
document [6]; Availability of contact information (address 
and/or email id) of the owner of the web-document; Analyzing 
the link integrity of the website [7]. A website with balanced 
internal and external links is more credible. Also it should not 
have any broken link; Analyzing the header/ footer of the 
web-site for any affiliation (if available) [8] ; Completeness 
and accuracy of the information; Author's expertise in the 
subject [9]; Author's opinion biased or un -biased. Analyzing 
the sentiment (positive, negative or neutral) of a website can 
determine author's opinion; Author's connection to the 
source of publication; connection to the intended audience 
[6] ; Author's point of view is objective and impartial or not 
[6]; Author's credentials like institutional affiliation (where 
he or she works), educational background, past writings, or 
experience. [6]; Purpose of the web-page (somewhat reflected 
by the type of website); Interactivity, Usability of the web- 
site; Structure of the web-site in terms of graphics and text 
are appropriate or not [7]; Quality of information on the web- 
site: elementary, technical, or advanced; "Tone" of the 
webpage: ironic, humorous, exaggerated or overblown 
arguments [10]; Determining if advertising and informational 
content are being supplied by the same person or 
organization; If so, advertising likely to bias informational 
content [8]; Determining any software requirements that may 
limit access to web information [11]; Is the web-site better 
than the other. If so why; Ranking of the web-source with 
Google. Google has a patented Page-Ranking technology 
that can also be criterion of source evaluation [12]; Domain 
Experts ' view on the credibility of any web-document. As 
according to Amin [13], experts develop different strategies 
while seeking information from internet. These strategies can 
be helpful to a novice user. An expert can assess the credibility 
of a web document on the basis of two dimensions of credibility 
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- trustworthiness (well-intentioned, unbiased) and expertise 
(knowledgeable, competent) [14]. C. Multi-Criteria Decision 
Analysis (MCDA) When a decision is affected by more than 
one criterion, a Multi-criteria decision analysis is required. 
One of the methods for doing this analysis is "Potentially All 
Pairwise RanKings of all possible Alternatives" (PAPRIKA), 
in which a pair-wise ranking of all possible alternatives is 
done, so as to identify all dominating and un-dominating 
pairs. The dominating pairs are given more priority [15]. In 
this paper we are using this method to define initial weights 
to available criteria for defining and computing credibility 
indices. 

II. Hypothesis 

It is assumed that we can automate the process of 
assigning credibility indices to web-based documents based 
on various criteria for source evaluation. Presently to compute 
the credibility index of a website we are considering only 
four criteria: type of website, date of update, sentiment 
analysis and Google page Rank. To examine whether we 
have achieved our goal we will compare the performance of 
the tool regarding the credibility index with the rating of 
human judges. 

III. Method 

A. Design 

A Prototype of the tool is designed which takes Google 
Search results/Wikipedia external links for a given topic. And 
assigns a credibility score to each web-document based on 
different criteria. The weights (Table 1) are assigned using 
1000 Minds [16], decision making software which implements 
PAPRIKA method. 



TABLE I 
WEIGHTS ASSIGNED TO EACH CATEGORY 


Criterion 


Category 


1/ 


Points 


Type offt'ebsite 


Gov 


37.6% 


100 




Edu. or? 


30.3% 


S0.5S 




Info, net 


5.6% 


14 S9 




Com 


0.9% 


2,39 




Others 


0% 





Date c: Update 


Less man 1 vear 


13.2% 


100 




=-lvr&<5vr 


64% 


4S.4S 




>livr 


0.4% 


3.03 




Xct available 


0% 





Sentiment 


Xeutral 


214% 


100 




Positive 


7.7% 


35.9S 




Xesrative 


0% 







Xot available 


0% 





Goods Rank 


9-10 


27. S% 


100 




7-8 


24.4% 


S7.7 




5-6 


o.S% 


24.46 




34 


2.6% 


9.35 




1-2 


0% 






Based on the above weights, a credibility score is assigned 
to any website. For example if an organizational website which 
is recently updated, having positive bias and rated as 7 by 
Google ranking, will be given the following score : 
(80.58+100+35.98+87.7)/4 = 76.065 The results (search results 
with credibility scores) are displayed in a tabular form (see 
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Figure 1). User is allowed to edit any score or to download all 
the scores. Along with this, a separate module for checking 
link integrity is developed, which checks the link integrity of 
a website address entered by the user. 

A. Materials & Equipments 

The software tools used in the development of the 
prototype source evaluation tool are: Python [17] and Py2exe 
[18]; Pattern [19]; AlchemyAPI [20]; .Net Framework for web 
interface; Microsoft Access. 

B. Procedure 

A Python script is developed, which gets search results 
from Google or external links of Wikipedia page for a given 
topic using Pattern, does sentiment analysis using 
AlchemyAPI, obtains Google Rank for which url from http:// 
webinfodb.net/a/pr.php?url=<website_url>, obtains type of 
website by analyzing the url, calculates credibility score for 
each search URL based on four criteria and writes the result 
to a database file. The script uses Multi-threading concept 
so that it can make parallel calls to different URLs. And it is 
converted into an executable program using Py2exe. Another 
Python script is developed which checks the 'link Integrity' 
of a given URL It gives output as total number of 'internal 
links', 'external links' and 'broken links' . This script makes 
call to each link available on the given web-page. If the link is 
not opening - it is considered as broken link. While if the link 
is '#' it is a self-link. If a link to a web-page in the same domain 
exists then it is said to be an internal-link. And if it goes to 
another domain it is an external-link. A web-interface is 
developed which uses the above python scripts for back- 
end processing. 

m RESULTS 

A. Output Source Evaluation Tool 

The output of the tool when given a search topic "Tourism 
in India" is shown in Figure 1 . The user is also allowed to edit 
or download results as a database file. 

B. Empirical Vailidity 

A small group of seven people was asked to give credibility 
scores (1-10) for nine websites. The pilot study was 
conducted by sending an email to the group. The participants 
were provided with the URLs of nine websites with a small 
introduction about the factors affecting the credibility. With 
a concern of not biasing the opinion of the group, the 
introduction part (in the form of question and answer) was 
made brief. The results were matched with the scores provided 
by the developed tool. The scores given by the participants 
and the tool are as shown in table 2. Based on the mean of 
scores obtained by the participants the correlation coefficient 
of 0.484, p<.19 is obtained. 
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TABLE II 
SCORES GIVEN BY PARTICIPANTS AND THE TOOL 



Website URL 


Individual Rating 

of parti cipants 

(out of 10) 


Score by 
tool (out 
of 100) 


http: tourt3ni.goY.in. 


5.5.5.7.7.2.10 


65 


http: iudia.gov.iti. overseas visitjudi 
i medical india.php 


6.4.6.7 .9.9 J 


56 


http: w-mv. inrredibleindt2.org 


2.7.S.&.&.1.& 


51 


http: mra- japur.org.uk 


7.4.S.6.S.1D.6 


35 


http: wwiv.travelsgew eit.com travel 
asiaMo Medical-Tourism-to-Iudia 


9 3 A3 J. 62 


25 


http: inrnMnumba. org.uk 


1,4,7.6.6.3.5 


32 


http: utitv. i_diaholidav.org tudia- 
tourism.html 


10.5.7.5.SJ.S 


29 


http: www^grikmdsB_B 


4.4.62.6.7.4 


14 


http: u-vviY.tcurisni-cf-indLa.csni 


3.6.7.4.S.4.7 
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v. Conclusions & Discussion 

The prototype of the tool developed uses only four criteria 
for assigning credibility score to the website. The pilot study 
conducted to obtain empirical validity of the tool, gives a 
correlation with a low value (.48). The possible causes of this 
low correlation value can be summarized as follows: The tool 
is restricted only to four criteria. If all criteria (discussed in 
section I (B)) are included to obtain the credibility score, the 
system may give better results. Only then a higher correlation 
with subjects' judgments will be found; The weights assigned 
by the subjects had no agreement in themselves. Apparently 
they judged the website on different criteria, and not weighted 
overall. The work done in this paper can be enhanced further, 
by including all the criteria of source evaluation into an 
automated system. Some of the criteria may not be possible 
to evaluate automatically. We may use a database having 
meta-data given by experts in that case. Development of this 
system will be helpful to a novice user, giving him/her a level 
of confidence on the reliability of any specific web- 
document. Also integration of this Source Evaluation tool to 
existing automatic multi-document summarizers will help to 
achieve a better summarization. 
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Figure 1. Output of the tool for a searched topic 
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